Parallel Entity Resolution with Dedoop

نویسندگان
چکیده

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Graph-Parallel Entity Resolution using LSH & IMM

In this paper we describe graph-based parallel algorithms for entity resolution that improve over the map-reduce approach. We compare two approaches to parallelize a Locality Sensitive Hashing (LSH) accelerated, Iterative Match-Merge (IMM) entity resolution technique: BCP, where records hashed together are compared at a single node/reducer, vs an alternative mechanism (RCP) where comparison loa...

متن کامل

Dedoop: Efficient Deduplication with Hadoop

We demonstrate a powerful and easy-to-use tool called Dedoop (Deduplication with Hadoop) for MapReduce-based entity resolution (ER) of large datasets. Dedoop supports a browser-based specification of complex ER workflows including blocking and matching steps as well as the optional use of machine learning for the automatic generation of match classifiers. Specified workflows are automatically t...

متن کامل

P-Swoosh: Parallel Algorithm for Generic Entity Resolution

Entity Resolution (ER) is a problem that arises in many information integration applications. ER process identifies duplicated records that refer to the same real-world entity (match process), and derives composite information about the entity (merge process). Additionally, the merged record can match another records recursively. Since the ER process is typically compute-intensive, it is import...

متن کامل

Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases

Accurate and efficient entity resolution is an open challenge of particular relevance to intelligence organisations that collect large datasets from disparate sources with differing levels of quality and standard. Starting from a first-principles formulation of entity resolution, this paper presents a novel Entity Resolution algorithm that introduces a data-driven blocking and record linkage te...

متن کامل

Entity Resolution with Evolving Rules

Entity resolution (ER) identifies database records that refer to the same real world entity. In practice, ER is not a one-time process, but is constantly improved as the data, schema and application are better understood. We address the problem of keeping the ER result up-to-date when the ER logic “evolves” frequently. A naı̈ve approach that re-runs ER from scratch may not be tolerable for resol...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Datenbank-Spektrum

سال: 2012

ISSN: 1618-2162,1610-1995

DOI: 10.1007/s13222-012-0110-x